1 Introduction

Climate change has been, for a few decades, an important topic, as this global phenomenon has been affecting the Earth for many years. Scientists around the world try to understand the global warming phenomenon, its causes and try to predict its future trajectory in years to come. In this project, we try to predict global temperature by using time series theory. More precisely, we try to answer two main questions : is there a seasonality in the temperature measures, and what is the form of global warming. To do so, we first start by exploring and transforming the data in Data exploration, then we try modeling the global trend and its residuals by using Generalized Least Squares and ARMA models in Modeling the temperature using GLS and ARMA. Thanks to this model, we can answer the two main questions in Answering the two main questions. Additionally, we can now predict the temperature in the future, which is done in Prediction up to 2050. In this section, we also compare the GLS model to another model paradigm, which uses Generalized Additive Models and SARIMA models. Lastly, we conclude by stating the take-home messages and further steps that would be interesting to explore in Conclusion.

2 Data exploration

2.1 Dataset Description

The time series which will be used in this project is the temperature anomalies from January 1880 to December 2022. A temperature anomaly is a departure from a global mean value. A positive (negative) anomaly in this time series shows that the temperature was warmer (cooler) than the global mean.

2.2 Dataset transformations

As seen above, one can identify an increasing trend. For prediction purposes, one might only want to focus on recent years. For this reason, we chose to take a subset of the time series, from \(1960\) onwards, as one can argue that there is a changing point in the trend around this year, from fluctuating to a more or less linear growth.

3 Modeling the temperature using GLS and ARMA

3.1 Identifying a trend

From the above time-series, one now has to find a model which could fit the increasing trend of the temperature anomalies. In this project, we chose to fit the trend by using a few predictors : a linear, quadratic and exponential term with respect to the years, starting at \(1\) from 1960 onwards. Furthermore, we also added months as factors to show monthly seasonality for the temperatures. One can argue that most of the earth’s landmass is in the northern globe, there may be some sort of a seasonal behavior as the global mean would therefore be biased to a northen climate behavior.

From this model, we can have a first idea whether seasonality and if there is an exponential growth of the temperature anomalies. To do this, we show several t-tests below for each coefficients.

Table 3.1: Summary of the basic model
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.053 0.023 -2.341 0.019
t 0.001 0.000 5.880 0.000
I(t^2) 0.000 0.000 8.562 0.000
exp(t/20) 0.000 0.000 -2.974 0.003
month2 0.015 0.025 0.593 0.553
month3 0.042 0.025 1.690 0.091
month4 -0.011 0.025 -0.465 0.642
month5 -0.032 0.025 -1.297 0.195
month6 -0.041 0.025 -1.678 0.094
month7 -0.046 0.025 -1.864 0.063
month8 -0.039 0.025 -1.565 0.118
month9 -0.042 0.025 -1.719 0.086
month10 -0.028 0.025 -1.155 0.248
month11 -0.027 0.025 -1.114 0.266
month12 -0.037 0.025 -1.511 0.131

As one can see above, we see that coefficients are essentially close to zero, as the values we are trying to plot are usually close to zero, as seen in the histograms above. Furthermore, we observe a strong significance of all transformations of the time, but less significance for the months, except for March, June, September (\(0.09\)) and July (\(0.06\)) with p-values in parentheses. However, one big drawdown using this linear model is that we assume that, to use the Gauss-Markov theorem, that the data is iid, which is not the case here. Therefore, we introduce the usage of Generalized Least Squares:

\[Y = \textbf{X}\beta + \epsilon \text{ with } \epsilon \sim \mathcal{N}(0,\Sigma)\]

where \(Y\) is the the temperature anomaly, \(X\) is the data matrix (containing the time, its transformation and dummy variables for every month), and \(\epsilon\) is the noise following a normal distribution with a covariance \(\Sigma \in \mathbb{R}^{N \times N}\), which makes the noise possibly dependant between the \(N\) observations. If we knew the covariance matrix \(\Sigma\), and if it is invertible, one could define the new problem \[\tilde{Y} = \tilde{\textbf{X}}\beta + \tilde{\epsilon} \text{ with } \tilde{\epsilon} \sim \mathcal{N}(0,I_{N})\]

where \(\tilde{Y} = \Sigma^{-\frac{1}{2}}Y\), \(\tilde{\mathbf{X}} = \Sigma^{-\frac{1}{2}}\mathbf{X}\), \(\tilde{\epsilon} = \Sigma^{-\frac{1}{2}}\epsilon\), which enables us to consider this new data as iid. The goal is therfore to find this matrix \(\Sigma\). To do so, we will use the time series theory to model the residuals as an ARMA model, and give this information to a Generalized Least Squares model.

3.2 Finding the ARMA model for the residuals

From the initial fitted linear model described in Identifying a trend, we got residuals for which we can try to fit an ARMA model on to find the covariance matrix. To do so, we plot the ACF-PACF plots and try to find relevant orders for the \(AR(p)\) and \(MA(q)\) parts of the model.

From the above plots, we clearly identify \(AR(p=2)\), as there is a clear cut-off of the PACF values after the second lag. For \(q\), it is a bit more complicated but one could argue that after the second lag, there is a small drop from \(0.55\) to less than \(0.4\), we will therefore use this parameter \(q=2\). No additional seasonality has been found for the lags which are multiples of \(12\) (number of months). Now, one can look at various sub-models from the \(ARMA(p=2,q=2)\), and compare AIC and predictive checking to find the best model.

## -1270.616 -1266.817 -1265.299
## [1] 0.03298126 0.03331282 0.03331522

From above tests, one finds that the \(ARMA(p=2,q=2)\) is better than its submodels. Furthermore, we can check the residuals of the fitted ARMA model, and assert their normality :

Except for the upper-tail, the residuals are close to the diagonal on the Q-Q Plot, and no more lags stand out from the ACF-PACF plots.

3.3 Fitting the GLS model

Now that we have a model for the trend (see Identifying a trend) and for its residuals (see Finding the ARMA model for the residuals), one can define the GLS model as described above. Hereunder is a summary of the fitted GLS model :

\[ \tilde{Y} = \beta_0 + \beta_1 t + \beta_2 t^2 + \beta_3 \exp(t) + \epsilon \quad \text{ with } \epsilon \sim ARMA(p=2,q=2) \]

Table 3.2: Summary of the GLS model
Value Std.Error t-value p-value
(Intercept) -0.049 0.044 -1.113 0.266
t 0.001 0.000 2.011 0.045
I(t^2) 0.000 0.000 3.235 0.001
exp(t/20) 0.000 0.000 -1.526 0.127
month2 0.015 0.015 0.983 0.326
month3 0.042 0.016 2.574 0.010
month4 -0.011 0.019 -0.577 0.564
month5 -0.031 0.019 -1.673 0.095
month6 -0.040 0.020 -2.034 0.042
month7 -0.045 0.019 -2.317 0.021
month8 -0.037 0.020 -1.879 0.061
month9 -0.041 0.019 -2.198 0.028
month10 -0.027 0.019 -1.441 0.150
month11 -0.026 0.016 -1.593 0.112
month12 -0.035 0.015 -2.260 0.024

From the above summary, one can check that the coefficients for the quadratic and the exponential term are equal to zero, and that the exponential coefficient is not considered significant with a p-value of \(0.12\).

4 Answering the two main questions

To answer the two questions, as we can now work with iid transformed data, we can freely use the signifiance tests. Hereunder is an ANOVA Type II test to assert variable importance in the trend model:

Table 4.1: ANOVA Test-II Table for GLS model
Df Chisq Pr(>Chisq)
t 1 4.043 0.044
I(t^2) 1 10.467 0.001
exp(t/20) 1 2.329 0.127
month 11 34.502 0.000

To confirm what we have seen in the summary of the GLS model, this ANOVA test confirms that the exponential variable is not significant for this model. We can therefore refute the exponential trend of the temperature anomalies. Furthermore, we observe a strong significance regarding the month factor, meaning that there is seasonality regarding the temperature month per month, even though the collected data should be without seasonality as it is supposed to represent the global temperature around the world. As stated i the introduction, this could be due to more land in the north hemisphere than in the south.

5 Prediction up to 2050

5.1 Comparison with GAM trend and SARIMA

Instead of working with Generalized Least Square, one can first try to fit the trend with a Generalized Additive model and then find SARIMA parameters for its residuals. This can help us answer the seasonality question based on the parameters found in the SARIMA model for residuals, but does not answer the question regarding the exponential growth. Therefore, this section is considered additional and only gives the reader another way to forecast temperature anomalies, without assessing the two questions.

First, we start by defining a Generalized Additive Model for trend fitting. A generalized additive model is a generalized linear model which includes non-linear relationships between variable and the response. The response is given by a sum of smooth functions (usually spline basis functions) of the variables as well as any other GLM-compatible variable. For further details, see Hastie 1986. In our case, we use a GAM model with smoothing function on the time t, and we also add the months as factors, to add the monthly seasonality.

From the residuals, we can apply the same type of analysis than previously done with the linear model residuals in Identifying a trend. Hereunder is the ACF-PACF plots from the residuals :

From the ACF plot, we first observe that those are exactly the same as the previous residual analysis. We observe exponential decay, which goes below the significance threshold after the order \(5-6\), but no real cut-off. From the PACF, we observe a cutoff after lag \(2\). One could therefore think of \(p=2\) for the AR part of the ARMA model, and no clear value for \(q\), even though we see an important decrease (yet no cut-off) at \(q=2\).

QQ-Plots seem reasonable, PACF and ACF looks good for white noise.

Now one can have a look at submodels of the ARMA plot and check whether lower orders for the AR and MA component make sense

## -1276.773 -1274.074 -1271.36 -1271.385
## [1] 0.03243841 0.03237781 0.03268235

## [1] -1278.698 -1276.773 -1275.818
## [1] 0.03237781 0.03222906 0.03243106

So the final model would be SARIMA (2,0,2)x(2,0,0). auto.arima would give us ARIMA(2,0,3)(1,0,0)[12] with zero mean, which is slightly different from what we have found. However, we get lower AIC (-1277.461 vs -1278.698) and lower residual error (0.03222906 vs 0.03252401)

## [1] -1277.461 -1278.698
## [1] 0.03237781 0.03222906 0.03252401

5.2 Forecasting up to 2050

6 Conclusion